MLP

Tao ZOU

2024-01-08

Hypothesis

The principle of machine learning model is to learn a conditional distribution. Models are different in condition generation method.

Linear network model

\[ \begin{bmatrix}x_{11}&x_{12}&x_{13}&\dots&x_{1m}\\x_{21}&x_{22}&x_{23}&\dots&x_{2m}\\\vdots&\vdots&\vdots&\ddots&\vdots\\x_{n1}&x_{n2}&x_{n3}&\dots&x_{nm}\end{bmatrix} \begin{bmatrix}w_{11}&w_{12}&\dots&w_{1h}\\w_{21}&w_{22}&\dots&x_{2h}\\w_{31}&w_{32}&\dots&w_{3h}\\\vdots&\vdots&\ddots&\vdots\\w_{m1}&w_{m2}&\dots&w_{mh}\end{bmatrix} \]

Mandatory requests

We hop that the train data matrix and validation data matrix both can represent the same distribution of \(\vec{X}_{1\times m}\).

Therefore, samples in data matrix should be independent.

  1. If samples in train data matrix or validation data matrix are dependent, then there will be risk in effectiveness of prediction on validation data matrix.

  2. If train data matrix and validation matrix are dependent, then there will be risk in effectiveness of prediction on test data matrix.

Linear networks can only learn linear relationship from predictors to response variable

The best model performance is only affected by: Raw Data, New Information, Linear Model.

I demonstrate the ideal model’s performance over different situations below.

import plotly.graph_objs as go
from plotly.subplots import make_subplots

x = np.linspace(0, 5, 1000)
y1 = 1 - np.exp(-3 *x)
y2 = (1 - np.exp(-2 * x)) * 0.8
y3 = (1 - np.exp(-1 * x)) * 0.5

fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Scatter(x=x, y=y1, mode='markers', name='line1', marker=dict(color='RoyalBlue')))
fig.add_trace(go.Scatter(x=x, y=y2, mode='markers', name='line2', marker=dict(color='red')))
fig.add_trace(go.Scatter(x=x, y=y3, mode='markers', name='line3', marker=dict(color='green')))

fig.add_shape(type="line",
      x0=0, y0=1, x1=1, y1=1,
      xref='paper', yref='y',
      line=dict(color='RoyalBlue', width=3, dash='dash'))
fig.add_shape(type="line",
      x0=0, y0=0.8, x1=1, y1=0.8,
      xref='paper', yref='y',
      line=dict(color='red', width=3, dash='dash'))
fig.add_shape(type="line",
      x0=0, y0=0.5, x1=1, y1=0.5,
      xref='paper', yref='y',
      line=dict(color='green', width=3, dash='dash'))
fig.update_layout(
    title="Model's Performance over Different Situations",
    xaxis_title="Time Cost to Optimal State",
    yaxis_title="Model's Performance"
)
fig.show()

Non-mandatory requests

If attributes are dependent, the best model performance won’t be affected.

If the attributes are independent, the best model performance won’t be affected.

One layer linear network

\[ \begin{bmatrix}x_{11}&x_{12}&x_{13}&\dots&x_{1m}\\x_{21}&x_{22}&x_{23}&\dots&x_{2m}\\\vdots&\vdots&\vdots&\ddots&\vdots\\x_{n1}&x_{n2}&x_{n3}&\dots&x_{nm}\end{bmatrix} \begin{bmatrix}w_{11}\\w_{21}\\w_{32}\\\vdots\\w_{m1}\end{bmatrix} = \begin{bmatrix}o_{11}\\o_{21}\\\vdots\\o_{n1}\end{bmatrix} \]

  1. If the scale of different attributes vary significantly, the steepness of their parameter space will be inconsistent, which may cause the loss function to oscillate and increase during training process. However, this problem seems to be solved in multi-layer networks.

Multi-layer linear network

Code

Below is a complete piece of code. I can directly modify and run the code to do experiment about MLP.

MyDataset

import numpy as np
import pandas as pd
import torch
from torch.nn import nn
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from sklearn.model_selection import train_test_split

class MyDataset(Dataset):
  def __init__(self, input_data, input_label, features_transform=None, labels_transform=None):
    self.input_data = input_data
    self.input_label = input_label
    self.features_transform = features_transform
    self.labels_transform = labels_transform

  def __len__(self):
    return len(self.input_data)

  def __getitem__(self, idx):
    feature = self.input_data[idx]
    if self.features_transform:
      feature = self.features_transform(feature)

    label = self.input_label[idx]
    if self.labels_transform:
      label = self.labels_transform(label)

    return feature, label

device = "cuda" if torch.cuda.is_available() else "cpu"

Compelete code

column1 = np.random.normal(5, 2, size=(10000,)).astype(np.float32)
column2 = np.random.normal(5, 2, size=(10000,)).astype(np.float32)
column3 = np.random.normal(10, 3, size=(10000,)).astype(np.float32)
column4 = np.random.normal(7, 10, size=(10000,)).astype(np.float32)
column5 = np.random.normal(1, 10, size=(10000,)).astype(np.float32)
column6 = np.random.normal(2, 1, size=(10000,)).astype(np.float32)
column7 = np.random.normal(100, 100, size=(10000,)).astype(np.float32)

temp = np.diff(column1)
column1_2 = np.append(temp, temp[-1])
column1_1 = (column1 + 1) * 1.5
column2_1 = column2 ** 2

column_ = 2 * column1_2 - 3 * column2

df = pd.DataFrame({'column1': column1, 'column2': column2, 'column3': column3, 'column4': column4, 'column5': column5, 'column6': column6, 'column7': column7,
           'column1_1': column1_1, 'column2_1': column2_1, 'column1_2': column1_2})
myfeatures = torch.tensor(df.loc[:, ['column1_2', 'column2']].values).float()
mylabels = torch.tensor(column_).float()

train_features, val_features, train_labels, val_labels = train_test_split(myfeatures, mylabels, test_size=0.2, random_state=42)

train_dataset = MyDataset(train_features, train_labels, None, None)
val_dataset = MyDataset(val_features, val_labels, None, None)

model = nn.Sequential(nn.Linear(train_features.shape[1], 4, bias=False),
            nn.ReLU(),
            nn.Linear(4, 4, bias=False),
            nn.ReLU(),
            nn.Linear(4, 1, bias=False))

# model = nn.Sequential(nn.Linear(train_features.shape[1], 1, bias=False))
# def init_weights(m):
#   if type(m) == nn.Linear:
#     nn.init.xavier_uniform_(m.weight)
#     # with torch.no_grad():
#     #   m.weight = nn.Parameter(torch.tensor([[0., 0.]]))
# model.apply(init_weights)

num_epochs = 100
batch_size = 256
lr = 0.001
model.to(device)
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

myloss = []
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
for epoch in range(num_epochs):
  train_loss = 0
  model.train()
  for batch_features, batch_labels in train_loader:
    optimizer.zero_grad()
    train_outputs = model(batch_features)
    loss = criterion(train_outputs, batch_labels.reshape(-1, 1))
    train_loss += loss.item()
    loss.backward()
    optimizer.step()
  train_loss /= len(train_loader)

  if epoch % 10 == 0:
    val_loss = 0
    model.eval()
    with torch.no_grad():
      for batch_features, batch_labels in val_loader:
        val_outputs = model(batch_features)
        loss = criterion(val_outputs, batch_labels.reshape(-1, 1))
        val_loss += loss.item()
    val_loss /= len(val_loader)
    print('epoch {}/{} train loss: {:.2f}, val loss: {:.2f}'.format(epoch, num_epochs, train_loss, val_loss))

The average loss value calculating method in the code chunk above is an approximate method, because the number of samples in the last batch is likely to be smaller than the batch_size.